Skip to content

StreamInterface: prevent socket/reader-thread leak on handshake failure#918

Open
nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
nightjoker7:fix-stream-handshake-leak
Open

StreamInterface: prevent socket/reader-thread leak on handshake failure#918
nightjoker7 wants to merge 1 commit intomeshtastic:masterfrom
nightjoker7:fix-stream-handshake-leak

Conversation

@nightjoker7
Copy link
Copy Markdown

@nightjoker7 nightjoker7 commented Apr 23, 2026

Problem

If StreamInterface.__init__ calls connect() or waitForConfig() and either raises (handshake timeout, config error, bad stream), the partially-constructed object propagates the exception up — but the reader thread spawned in connect() keeps running and the underlying stream/socket stays open. The caller never received a reference to the half-initialized instance, so they cannot call close() themselves.

In a reconnect loop (every Python client retrying after a flaky connection), each failed attempt leaks one socket + one thread. Compounds quickly under load.

This is consistent with the symptom shape in meshtastic/firmware#10101 (Station G2 TCP drops with MeshMonitor): the MeshMonitor client retries, retries leak fds, eventually the OS pressure shows up as connection refusal until the leaks time out.

Fix

Wrap the __init__ handshake calls in try/except. On failure, call self.close() (suppressed inside contextlib.suppress(Exception) so a secondary error doesn't shadow the original), then re-raise.

Also handle the case where close() is called from the cleanup path before connect() started the reader thread — _rxThread.join() raises RuntimeError when called on a never-started thread.

Tests added

Four unit tests in test_stream_interface.py:

  1. close() is safe when the reader thread was never started
  2. __init__ cleanup runs when connect() raises, original exception propagates
  3. __init__ cleanup runs when waitForConfig() raises, original exception propagates
  4. A failure in close() during cleanup does not shadow the original handshake failure

All passing locally, exercise the previously-uncovered patch lines.

Diff

meshtastic/stream_interface.py: +9 / -2 (the actual fix)
meshtastic/tests/test_stream_interface.py: +95 / 0 (the tests)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a resource-leak failure mode in StreamInterface where exceptions during initial protocol handshake (connect()/waitForConfig()) could leave a running reader thread and an open underlying transport (serial/TCP), even though the constructor raised before returning an instance to the caller.

Changes:

  • Wrap connect() + waitForConfig() in StreamInterface.__init__ with exception handling that attempts to close() before re-raising.
  • Make StreamInterface.close() tolerate RuntimeError from joining an unstarted reader thread.

Comment on lines +156 to +157
# Nothing to join; safe to ignore.
pass
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If connect() fails before the reader thread is started (e.g., serial write raises before _rxThread.start()), this RuntimeError path will skip joining but also won’t close the underlying stream. For SerialInterface (which opens the port before calling StreamInterface.__init__), this means the serial port can remain open even though close() was invoked. Consider explicitly closing self.stream (and setting it to None) when the thread was never started / is not alive, instead of just pass.

Suggested change
# Nothing to join; safe to ignore.
pass
# In that case the reader thread cannot run _disconnected(),
# so close the stream here to avoid leaking the resource.
if self.stream is not None:
with contextlib.suppress(Exception):
self.stream.close()
self.stream = None

Copilot uses AI. Check for mistakes.
Comment on lines 150 to +153
if self._rxThread != threading.current_thread():
self._rxThread.join() # wait for it to exit
try:
self._rxThread.join() # wait for it to exit
except RuntimeError:
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join() without a timeout can block indefinitely if the reader thread is stuck in a blocking _readBytes() (e.g., TCPInterface uses blocking socket.recv() with no timeout). With the new __init__ exception handler calling self.close(), a handshake timeout against a silent TCP peer can hang here and never re-raise the original exception. Consider using a bounded join (and letting transport-specific close() logic interrupt the read), or otherwise ensuring blocking reads are interrupted before joining.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +78
try:
self.connect()
if not noProto:
self.waitForConfig()
except Exception:
# Handshake failed (timeout, config error, bad stream). The caller
# never receives a reference to this half-initialized instance, so
# they cannot call close() themselves. If we don't clean up here,
# the reader thread (already started by connect()) keeps running
# and the underlying stream/socket leaks — the leak compounds on
# every retry from the caller's reconnect loop.
with contextlib.suppress(Exception):
self.close()
raise
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New failure-path behavior (calling close() from __init__ when connect() / waitForConfig() raises) is important for leak prevention but currently isn’t covered by unit tests. Since there are existing tests for stream_interface.py, consider adding a test that forces connect() or waitForConfig() to raise and asserts cleanup occurs (e.g., close() is invoked and doesn’t raise when the thread wasn’t started).

Copilot generated this review using guidance from repository custom instructions.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 46.15385% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.52%. Comparing base (cec79a7) to head (9011baa).

Files with missing lines Patch % Lines
meshtastic/stream_interface.py 46.15% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #918      +/-   ##
==========================================
- Coverage   61.57%   61.52%   -0.06%     
==========================================
  Files          25       25              
  Lines        4448     4457       +9     
==========================================
+ Hits         2739     2742       +3     
- Misses       1709     1715       +6     
Flag Coverage Δ
unittests 61.52% <46.15%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…re in __init__

If connect() or waitForConfig() raises during __init__ (handshake timeout,
bad stream, config error), the reader thread started by connect() keeps
running and the underlying stream/socket stays open — but the caller never
receives a reference to the half-initialized instance, so they cannot call
close() themselves. The leak compounds on every retry from a caller's
reconnect loop.

Fix: wrap connect() + waitForConfig() in try/except; call self.close() on
any exception before re-raising. Also guard close() against RuntimeError
from joining an unstarted reader thread (happens when close() runs from
a failed __init__ before connect() could spawn it).

Discovered while debugging a real-world Meshtastic firmware crash where
a passive logger's retrying TCPInterface() calls against a node with
250-entry NodeDB produced a reconnect storm — every retry triggered a
full config+NodeDB dump on the node, compounding heap pressure, which
then exposed null-deref bugs in Router::perhapsDecode / MeshService
(firmware side fixed in meshtastic/firmware#10226 and #10229). The
client-side leak is independent of those firmware bugs and worth fixing
on its own.
@nightjoker7 nightjoker7 force-pushed the fix-stream-handshake-leak branch from 9011baa to bfe38ac Compare April 25, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants